Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Azure and GCP INFRA WORKLOAD configuration related issues #149

Conversation

qiliRedHat
Copy link
Contributor

@qiliRedHat qiliRedHat commented May 16, 2022

Fix #148 and #153
Summary of changes:

  1. Fix hard code of location on azure
  2. Update the storage class of Azure to the default storageclass managed-csi
  3. Add comment to README that Azure only support region: centralus
  4. Add cluster name to machiesets name as prefix for all providers, this will help to identify owner of resources.
  5. add infra and workload label to metadata of GCP machinesets INFRA and WORKLOAD machinesets are scaled because the have no infra and workload label #153

@openshift-ci openshift-ci bot requested review from memodi and skordas May 16, 2022 09:16
@qiliRedHat qiliRedHat changed the title Cluster post config fix aure location Cluster post config fix Aure INFRA WORKLOAD name location and zone May 16, 2022
@qiliRedHat qiliRedHat force-pushed the cluster-post-config-fix-aure-location branch from 79ff985 to a83b4ac Compare May 16, 2022 10:37
@qiliRedHat
Copy link
Contributor Author

Tested with Job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/qili-e2e-benchmark/job/cluster-post-config-fix-aure-location/8/ successfully
Now infra machines are provisioned and spread on 3 zones.

 % oc get machines -A
NAMESPACE               NAME                                                  PHASE     TYPE               REGION      ZONE   AGE
openshift-machine-api   infra-centralus1-plt57                                Running   Standard_D48s_v3   centralus   1      6m2s
openshift-machine-api   infra-centralus2-bfqlt                                Running   Standard_D48s_v3   centralus   2      6m2s
openshift-machine-api   infra-centralus3-7jqnt                                Running   Standard_D48s_v3   centralus   3      6m2s
openshift-machine-api   qili-preserve-az-0516-8lmg7-master-0                  Running   Standard_D4s_v3    centralus   2      65m
openshift-machine-api   qili-preserve-az-0516-8lmg7-master-1                  Running   Standard_D4s_v3    centralus   1      65m
openshift-machine-api   qili-preserve-az-0516-8lmg7-master-2                  Running   Standard_D4s_v3    centralus   3      65m
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus1-kzd8v   Running   Standard_D4s_v3    centralus   1      54m
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus2-xvsmx   Running   Standard_D4s_v3    centralus   2      54m
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus3-95ws4   Running   Standard_D4s_v3    centralus   3      54m
openshift-machine-api   workload-centralus-jkpwh                              Running   Standard_D32s_v3   centralus   1      5m53s
% oc get machinesets -A
NAMESPACE               NAME                                            DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   infra-centralus1                                1         1         1       1           6m22s
openshift-machine-api   infra-centralus2                                1         1         1       1           6m22s
openshift-machine-api   infra-centralus3                                1         1         1       1           6m22s
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus1   1         1         1       1           66m
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus2   1         1         1       1           66m
openshift-machine-api   qili-preserve-az-0516-8lmg7-worker-centralus3   1         1         1       1           66m
openshift-machine-api   workload-centralus                              1         1         1       1           6m13s

@qiliRedHat
Copy link
Contributor Author

@paigerube14 Please take a look at #148 and let me know if you still want to pass the location as parameter.

@@ -6,7 +6,7 @@ metadata:
${MACHINESET_METADATA_LABEL_PREFIX}/cluster-api-cluster: ${CLUSTER_NAME}
${MACHINESET_METADATA_LABEL_PREFIX}/cluster-api-machine-role: workload
${MACHINESET_METADATA_LABEL_PREFIX}/cluster-api-machine-type: workload
name: workload-${CLUSTER_NAME}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to leave cluster_name in the name of the object to be able to properly run destroy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paigerube14 Thanks for pointing this out. I added ${CLUSTER_NAME} to all INFRA and WORKLOAD machinesets' name

@paigerube14
Copy link
Contributor

I did try with a cluster that was in the northcentralus zone and because it has no availability zones the infra and workload nodes failed to create. Think we can go ahead with this code because it makes it more robust. Maybe we need to add a common errors section to the readme for this branch? Thoughts?

@qiliRedHat qiliRedHat force-pushed the cluster-post-config-fix-aure-location branch from 57dade4 to 9b2e1cf Compare May 17, 2022 04:10
@qiliRedHat
Copy link
Contributor Author

Now machinesets name looks like

oc get machinesets -A
NAMESPACE               NAME                               DESIRED   CURRENT   READY   AVAILABLE   AGE
openshift-machine-api   qili-az-d9sqx-infra-centralus1     1         1         1       1           24m
openshift-machine-api   qili-az-d9sqx-infra-centralus2     1         1         1       1           24m
openshift-machine-api   qili-az-d9sqx-infra-centralus3     1         1         1       1           24m
openshift-machine-api   qili-az-d9sqx-worker-centralus1    1         1         1       1           142m
openshift-machine-api   qili-az-d9sqx-worker-centralus2    1         1         1       1           142m
openshift-machine-api   qili-az-d9sqx-worker-centralus3    1         1         1       1           142m
openshift-machine-api   qili-az-d9sqx-workload-centralus   1         1         1       1           24m

@qiliRedHat
Copy link
Contributor Author

Found monitoring co degraded. The alertmanager and prometheus-k8s pods were not successfully created on INFRA nodes.

Message:               Failed to rollout the stack. Error: updating alertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: expected 2 replicas, got 0 updated replicas
updating prometheus-k8s: waiting for Prometheus object changes failed: waiting for Prometheus openshift-monitoring/k8s: expected 2 replicas, got 0 updated replicas

Root cause is the storage class configured by our job by default not correct. I changed it to the default storageclass managed-csi on Azure, everything works.

% oc get storageclass
NAME                    PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
azurefile-csi           file.csi.azure.com   Delete          Immediate              true                   58m
azurefile-csi-nfs       file.csi.azure.com   Delete          Immediate              true                   58m
managed-csi (default)   disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   59m

Updated the description of the ENV_VARS for Azure to the correct storage class name.
Tested with this job successfully.
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/qili-e2e-benchmark/job/cluster-post-config-fix-aure-location/10/

% oc get co --no-headers| grep -v 'True.*False.*False'
oc get nodes | grep infra
qili-az-d9sqx-infra-centralus1-zhjwz     Ready    infra      34m    v1.23.3+69213f8
qili-az-d9sqx-infra-centralus2-q8bvl     Ready    infra      34m    v1.23.3+69213f8
qili-az-d9sqx-infra-centralus3-tw8q8     Ready    infra      34m    v1.23.3+69213f8
% oc get pods -o wide -n openshift-monitoring        
NAME                                                     READY   STATUS    RESTARTS   AGE    IP            NODE                                     NOMINATED NODE   READINESS GATES
alertmanager-main-0                                      6/6     Running   0          33m    10.131.2.10   qili-az-d9sqx-infra-centralus3-tw8q8     <none>           <none>
alertmanager-main-1                                      6/6     Running   0          33m    10.128.4.13   qili-az-d9sqx-infra-centralus2-q8bvl     <none>           <none>
cluster-monitoring-operator-fb7f97484-g2r7g              2/2     Running   0          157m   10.130.0.9    qili-az-d9sqx-master-0                   <none>           <none>
kube-state-metrics-566586664f-9g7q6                      3/3     Running   0          33m    10.128.4.10   qili-az-d9sqx-infra-centralus2-q8bvl     <none>           <none>
node-exporter-5chwb                                      2/2     Running   0          131m   10.0.0.6      qili-az-d9sqx-master-0                   <none>           <none>
node-exporter-76t22                                      2/2     Running   0          35m    10.0.128.8    qili-az-d9sqx-infra-centralus2-q8bvl     <none>           <none>
node-exporter-b8ttp                                      2/2     Running   0          131m   10.0.128.4    qili-az-d9sqx-worker-centralus2-x2d5f    <none>           <none>
node-exporter-cq67d                                      2/2     Running   0          131m   10.0.0.7      qili-az-d9sqx-master-1                   <none>           <none>
node-exporter-d9wbb                                      2/2     Running   0          131m   10.0.0.8      qili-az-d9sqx-master-2                   <none>           <none>
node-exporter-g968m                                      2/2     Running   0          131m   10.0.128.6    qili-az-d9sqx-worker-centralus3-xlhjl    <none>           <none>
node-exporter-qlf2v                                      2/2     Running   0          35m    10.0.128.9    qili-az-d9sqx-infra-centralus3-tw8q8     <none>           <none>
node-exporter-qx5w6                                      2/2     Running   0          131m   10.0.128.5    qili-az-d9sqx-worker-centralus1-5lqxf    <none>           <none>
node-exporter-sqnxb                                      2/2     Running   0          35m    10.0.128.10   qili-az-d9sqx-workload-centralus-54lw4   <none>           <none>
node-exporter-xpz6v                                      2/2     Running   0          35m    10.0.128.7    qili-az-d9sqx-infra-centralus1-zhjwz     <none>           <none>
openshift-state-metrics-78df6779c9-sm7cq                 3/3     Running   0          131m   10.129.2.10   qili-az-d9sqx-worker-centralus3-xlhjl    <none>           <none>
prometheus-adapter-b98448c67-8p4hv                       1/1     Running   0          33m    10.128.4.11   qili-az-d9sqx-infra-centralus2-q8bvl     <none>           <none>
prometheus-adapter-b98448c67-hbmpn                       1/1     Running   0          33m    10.129.4.7    qili-az-d9sqx-infra-centralus1-zhjwz     <none>           <none>
prometheus-k8s-0                                         6/6     Running   0          33m    10.128.4.12   qili-az-d9sqx-infra-centralus2-q8bvl     <none>           <none>
prometheus-k8s-1                                         6/6     Running   0          33m    10.129.4.8    qili-az-d9sqx-infra-centralus1-zhjwz     <none>           <none>
prometheus-operator-97985466-26j4g                       2/2     Running   0          34m    10.128.4.9    qili-az-d9sqx-infra-centralus2-q8bvl     <none>           <none>
prometheus-operator-admission-webhook-5c4885678b-mhvxd   1/1     Running   0          34m    10.128.4.8    qili-az-d9sqx-infra-centralus2-q8bvl     <none>           <none>
prometheus-operator-admission-webhook-5c4885678b-zb2px   1/1     Running   0          34m    10.131.2.5    qili-az-d9sqx-infra-centralus3-tw8q8     <none>           <none>
telemeter-client-74b9d948b7-nxwng                        3/3     Running   0          130m   10.131.0.17   qili-az-d9sqx-worker-centralus2-x2d5f    <none>           <none>
thanos-querier-955fcbfdd-9cmwn                           6/6     Running   0          131m   10.131.0.14   qili-az-d9sqx-worker-centralus2-x2d5f    <none>           <none>
thanos-querier-955fcbfdd-lhzrk                           6/6     Running   0          131m   10.129.2.13   qili-az-d9sqx-worker-centralus3-xlhjl    <none>           <none>

@qiliRedHat
Copy link
Contributor Author

@paigerube14 merged commits, please take a look again. Thanks.

@qiliRedHat
Copy link
Contributor Author

Opened #152 to fix the default value passed by cluster-workers-scaling

@qiliRedHat qiliRedHat force-pushed the cluster-post-config-fix-aure-location branch from 64eaf5b to c7d1796 Compare May 17, 2022 07:48
…, add cluster name to machieset name prefix, add comment to README that Azure only support region: centralus. Add infra and workload label to metadata of GCP machinesets

add infra and workload label to metadata of GCP machinesets
@qiliRedHat qiliRedHat force-pushed the cluster-post-config-fix-aure-location branch from 91ed48c to a2640ed Compare May 17, 2022 11:46
@qiliRedHat qiliRedHat changed the title Cluster post config fix Aure INFRA WORKLOAD name location and zone Fix Azure and GCP INFRA WORKLOAD configuration related issues May 17, 2022
@paigerube14
Copy link
Contributor

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented May 17, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: paigerube14, qiliRedHat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit d7cfdac into openshift-qe:cluster-post-config May 17, 2022
@qiliRedHat qiliRedHat deleted the cluster-post-config-fix-aure-location branch May 18, 2022 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants